Accelerated Likelihood Maximization for Diffusion-based Versatile Content Generation

ICLR 2026 Submission
Anonymous Authors

Applications of ALM

ALM teaser image.

Accelerated Likelihood Maximization (ALM) tackles diverse versatile content generation tasks. By explicitly maximizing the likelihood of unobserved region given the observed content, ALM achieves state-of-the-art performance on diverse applications, effectively alleviating the limitations of prior approaches.

TL;DR

Accelerated Likelihood Maximization is a tailored strategy for versatile content generation, supporting various inpainting and outpainting scenarios; image inpainting, wide image generation, human motion completion, and long video generation.

Abstract

Generating diverse, coherent, and plausible content from partially given inputs remains a significant challenge for pretrained diffusion models. Existing approaches face clear limitations: training-based approaches offer strong task-specific results but require costly data and computation, and they generalize poorly across tasks. Training-free paradigms are more efficient and broadly applicable, but often fail to produce globally consistent results, as they usually enforce constraints only on observed regions. To address these limitations, we introduce Accelerated Likelihood Maximization (ALM), a novel training-free sampling strategy integrated into the reverse process of diffusion models. ALM explicitly optimizes the unobserved regions by jointly maximizing both conditional and joint likelihoods. This ensures that the generated content is not only faithful to the given input but also globally coherent and plausible. We further incorporate an acceleration mechanism to enable efficient computation. Experimental results demonstrate that ALM consistently outperforms state-of-the-art methods in various data domains and tasks, establishing a powerful, training-free paradigm for versatile content generation.

Image Inpainting

We show the qualitative results of ALM on the image inpainting task. First, we compare our method with state-of-the-art image inpainting approaches. We use the pretrained Stable Diffusion for the experiment. As shown in Figure, the proposed method shows superior performance on diverse image inpainting scenarios.

Secondly, we show diverse examples of inpainted image using our method.

Lastly, we demonstrate that our method is robust to the backbone model by providing image examples generated using (1) the pretrained unconditional diffusion model and (2) the pretrained Stable Diffusion XL.

Comparisons with Baselines

Additional Inpainting Results

Experiment across Diverse Backbone Architectures

Wide Image Generation

We show that our method is capable of the wide image generation task by leveraging autoregressive image outpainting. The following figure compares the proposed method with state-of-the-art wide image generation methods, including SyncTweedies and SyncSDE. While baseline algorithms generate images with blurry parts or artifacts, ALM generates high-quality wide image, demonstrating its superior performance.

We additionally visualize examples of wide images generated with diverse text prompts.

Comparisons with Baselines

Additional Results

Long Video Generation

We show the qualitative results of ALM on long video generation. Similar to wide image generation, we autoregressively generate short video sequences and concatenate them, resulting in a long video sequence. We use the pretrained text-to-video model, Lavie, to generate 16-frame (short) video. Then we extend it to 104-frame (long) video. Our method effectively generates long video sequence conditioned on diverse text prompts.

Results

“Macro shot of smoke curling upward in the air, high contrast background, slow motion.”

“Macro shot of bubbles rising in sparkling water, cinematic lighting, 8K.”

“Wide shot of waves shimmering under sunlight, high dynamic range, 4K clarity.”

“Wide shot of fire glowing in a rustic cabin fireplace, cozy cinematic look.”

“Macro close-up of glitter slowly falling through water, sparkling particles, 8K.”

“Slow zoom onto a sparkler fizzing softly in the dark, glowing particles, ultra HDR.”

Human Motion Inpainting

We show an additional application of ALM in 3D human motion inpainting task. We evaluated the performance across two distinct scenarios: the “first-half prediction,” where the task is to predict the initial part of a sequence given only the latter half, and the “middle-half prediction,” where the model must fill in the central portion given the first and last quarters. For each sequence, the observed content is shown in orange, while the variable generated by ALM is shown in blue. The proposed method shows outstanding performance on human motion completion.

First-half Prediction

“person walks forward and stops”

“someone nervously pacing around in a circle”

“a man takes several steps forward, jumps over an imaginary obstacle, lands on both feet, takes several more steps forward, turns around 180 degrees, takes several steps forward and jumps over same obstacle and returns to start.”

“the person is walking back-and-forth in a zigzag.”



Middle-half Prediction

“a man walks to the left side and stands.”

“someone nervously pacing around in a circle”

“forward walking and it ends”

“the person is walking back-and-forth in a zigzag.”

Ablation Study

We first show the effectiveness of ALM through an ablation study. Without ALM - i.e. when following the core principle of SyncSDE, which only models the correlation between the unobserved variable and the observed content in the unobserved region - the method fails to generate plausible content. In contrast, ALM generates visually coherent outputs, clearly demonstrating the effectiveness of the proposed method.

Effects of ALM

We further analyze the effect of conditional likelihood term and joint likelihood term in the optimization objective (Eq. (4) of the main paper). As shown in the figure, both terms play significant roles in generating high-quality results.

Effects of each Likelihood Terms